An OCR Post-Correction Approach Using Deep Learning for Processing Medical Reports
نویسندگان
چکیده
According to a recent Deloitte study, the COVID-19 pandemic continues place huge strain on global health care sector. Covid-19 has also catalysed digital transformation across sector for improving operational efficiencies. As result, amount of digitally stored patient data such as discharge letters, scan images, test results or free text entries by doctors grown significantly. In 2020, 2314 exabytes medical was generated globally. This does not conform generic structure and is mostly in form unstructured scanned paper documents part patient’s reports. digitised using Optical Character Recognition (OCR) process. A key challenge here that accuracy OCR process varies due inability current engines correctly transcribe handwritten which may be skewed, obscured illegible. compounded fact processed comprised specific terminologies do necessarily general language lexicons. The proposed work uses deep neural network based self-supervised pre-training technique: Robustly Optimized Bidirectional Encoder Representations from Transformers (RoBERTa) can learn predict hidden (masked) sections fill gaps non-transcribable parts being processed. Evaluating method domain-specific datasets include real documents, shows significantly reduced word error rate demonstrating effectiveness approach.
منابع مشابه
An OCR Post-processing Approach Based on Multi-knowledge
This paper proposes an OCR post-processing approach based on multi-knowledge, which integrates language knowledge and candidate distance information given by the OCR engine. In this approach, statistical language model and semantic lexicon are combined, and candidate distance information is used to reduce the size of the search space. The experimental results show that this approach is very eff...
متن کاملTelugu OCR Framework using Deep Learning
In this paper, we address the task of Optical Character Recognition(OCR) for the Telugu script. We present an end-to-end framework that segments the text image, classifies the characters and extracts lines using a language model. The segmentation is based on mathematical morphology. The classification module, which is the most challenging task of the three, is a deep convolutional neural networ...
متن کاملOCR Post-Processing for Low Density Languages
We present a lexicon-free post-processing method for optical character recognition (OCR), implemented using weighted finite state machines. We evaluate the technique in a number of scenarios relevant for natural language processing, including creation of new OCR capabilities for low density languages, improvement of OCR performance for a native commercial system, acquisition of knowledge from a...
متن کاملOCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion
With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...
متن کاملOCR Post-Processing Error Correction Algorithm Using Google's Online Spelling Suggestion
With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Circuits and Systems for Video Technology
سال: 2022
ISSN: ['1051-8215', '1558-2205']
DOI: https://doi.org/10.1109/tcsvt.2021.3087641